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Abstract 

This paper addresses the problem of automated vehi¬ 
cle tracking and recognition from aerial image sequences. 
Motivated by its successes in the existing literature we fo¬ 
cus on the use of linear appearance subspaces to describe 
multi-view object appearance and highlight the challenges 
involved in their application as a part of a practical system. 
A working solution which includes steps for data extraction 
and normalization is described. In experiments on real- 
world data the proposed methodology achieved promising 
results with a high correct recognition rate and few, mean¬ 
ingful errors (type II errors whereby genuinely similar tar¬ 
gets are sometimes being confused with one another). Di¬ 
rections for future research and possible improvements of 
the proposed method are discussed. 


1. Introduction 

In this paper we address the problem of automatic recog¬ 
nition of ground objects (mainly vehicles) from image se¬ 
quences acquired from unmanned aircraft. This is a very 
challenging recognition scenario in which viewpoint and 
illumination can vary greatly, occlusions and background 
clutter are common, and data is of low quality. Gen¬ 
erally speaking, target recognition systems comprise two 
distinct tasks, that of (i) representing object appearance 
and of subsequent (ii) representation matching. Both of 
these tasks are pervasive across different object recogni¬ 
tion and matching problems, and have attracted a significant 
amount of research attention in the computer vision com¬ 
munity mi El El. This interest has particularly intensified 
in recent years, after significant advances towards practi¬ 
cally viable systems have been made □[HElISl. 

The most prominent group approaches are local 
descriptor-based. Methods of this group employ descrip¬ 


tors in a sparse fashion by focusing on a set of automati¬ 
cally localized interest points Cana. When the number of 
detected interest points is large this approach can achieve 
impressive robustness to partial occlusion and pose 1^ . 
However, a serious limitation of this approach is that it 
cannot deal well with untextured objects (sometimes re¬ 
ferred to as smooth objects) which by their very nature 
do not exhibit appearance which results in a larger num¬ 
ber of consistently well-localized interest points. This lim¬ 
itation has recently attracted increased research attention; 
shape-based approaches using boundary appearance fea¬ 
tures ca or pure shape liQ have demonstrated promising 
results on databases of objects which have distinct shapes. 
Other approaches include part-based methods, suitable to 
the recognition of articulated objects with distinctly recog¬ 
nizable parts (such as a face which can be seen as compris¬ 
ing two ‘eye parts’, a ‘mouth part’ etc. which vary in mutual 
configuration depending on the person’s pose and facial ex¬ 
pression) (23. A summary of different representation and 
matching techniques dominant in the existing literature is 
shown in Tables [T] and [2l 

There are two key conceptual contributions we make to 
the current state-of-the-art. Firstly most existing methods 
address the problem at hand using individual images. In 
contrast in this paper the focus is on recognition from image 
sequences. In other words a sequence of frames/images of 
an unknown, query object is matched against a database of 
sequences of known, database targets. This problem setting 
is of an increasing significance considering the ease with 
which in our application image sequences can be acquired 
and stored. Secondly we show how a robust and pose- 
invariant system can be built by enriching the set of directly 
acquired exemplars with synthetically generated data, and 
how the resulting sets can be appropriately described and 
matched. We first summarize the key ideas and then explain 
each element of our method in detail in the next section. 
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Table 1. Classification of the most infiuential object recognition 
representations. 
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QJ 

Global representations 

Appearance prototypes fTf\ 

Model-based 1^ 

U 

PU 

QJ 

U 


Pictorial structure 1^ 

W 

Local representations 

Dense local features 

0 





Sparse local features ll^ 


Table 2. Classification of the most infiuential representation 
matching approaches. 


OiD 

C 

Single-view 

Euclidean distance lIZTl 

Cosine distance |[^ 

s 


Single-view aggregation-based 

-S 

0 



§ 

Multi-view 

Probability density-based ITIl 



Embedded manifold-based 1^ 


Approach summary Considering the small scale of the 
target, model or part-based approaches are usually unsuit¬ 
able for the problem in question. Notwithstanding its lim¬ 
itations, in this paper we focus on the use of a multi¬ 
view model in the form of a set of raw holistic appear¬ 
ance patches. This model motivates increasingly popular 
manifold-based approaches for matching, which exploit the 
structure of object’s appearance change as viewing parame¬ 
ters are varied. Following its success in related recognition 
problems and the potential for real-time performance in an 
incremental learning framework due to its computational ef¬ 
ficiency and low storage requirements, here we specifically 
examine the use of canonical correlation analysis ifTTIl . 


2. Method details 


In this section we explain each of the constituent ele¬ 
ments of the proposed algorithm. We start by summarizing 
the motivation and theory underlying the approach we adopt 
for matching sets of exemplars. Then we explain how exem¬ 


plar data is extracted from raw imagery: Section [2.2. l| de- 
scribes how a detected target is tracked through a sequence 
of images while possibly undergoing pose change, while 
Section [2.2.2| explains how viewpoint invariance is achieved 
by enriching the explicitly extracted set of exemplars with 
additional, synthetically generated images. 


2.1. Overview of the baseline approach 

For classification, canonical correlation analysis 12^ is 
usually employed by computing canonical correlations (i.e. 
the cosines of principal angles) between linear manifolds 
GUlsIlEa. Canonical correlations, ^ < 0i < ... < Od < 
(7r/2) between any two d-dimensional linear manifolds or 
hyperplanes Ci and £ 2 , are uniquely defined as ifTSl : 

cos6>i = max max (1) 

subject to ujui =wjwi = 1 , ufuj = vfv^ = 0 , i 7 ^ j. 
The solution can be obtained by applying the Singular Value 
Decomposition (SVD) CD, whose complexity is 0{d^) 
where d is the dimensionality of the manifolds (d is typi¬ 
cally small). Each image set is represented by a linear mani¬ 
fold and the angles between two low-dimensional manifolds 
are exploited as a similarity measure between two image 
sets. The canonical vectors in each pair are visually similar 
despite the large changes of object pose. The first pair of 
canonical vectors corresponds to the most correlated vec¬ 
tors, each of which is spanned by any linear combination of 
the respective image set. The next pairs of canonical vectors 
represent the directions of the next most similar data varia¬ 
tions of the two sets in other dimensions. CCA effectively 
finds common modes (e.g. target object pose) of two image 
sets. 

2.2. Data extraction 

In designing a practical object recognition system it is 
of crucial importance to appreciate that the actual recogni¬ 
tion algorithm operates in the context of the preceding data 
acquisition and extraction stages. Specifically, in the task 
of target recognition from aerial images acquired using un¬ 
manned aircraft, the target needs to be located and tracked. 
On the lowest level, this is done by the on-board control 
system which employs the GPS coordinates of the target 
and the plane, and a gyroscope. Difficulties are already 
introduced here - the target moves widely across the im¬ 
age, as shown in Figure often even entirely disappearing 
from the view. Given that vehicles appear in aerial images 
mostly as smooth object, resulting in few stably detectable 
keypoints, we found local-feature-based tracking 1291 un¬ 
reliable. Instead we propose an alternative solution which 
comprises two steps. Firstly, the problem of initializing 
tracking by detection is solved by registering consecutive 
image frames globally. This is readily achieved because 
video sequences of interest in this paper are mostly com¬ 
prised of static backgrounds. Hence in the context of reg¬ 
istration of frames the target can be considered an outlier 
(in that it moves from a frame to frame). After consecu¬ 
tive frames are registered, the target itself is easily detected 
by performing simple background subtraction. To account 
for noise, morphological image processing is performed to 
















remove spurious regions of frame-to-frame difference, with 
the correct region reliably being detected as the one with 
the largest contiguous area over which the aforementioned 
difference exceeds a threshold. 



Figure 1. Raw input frame showing a poorly localized target by 
the aircraft’s on-board system. 


2.2.1 Target tracking 

Following the localization of the target, we track it until it 
is no longer entirely visible. For this we employ the well 
known Lucas-Kanade tracker Ena, with 6 affine degrees 
of freedom. In this algorithm, at each image-to-image tran¬ 
sition in a sequence, the generalized position vector is itera¬ 
tively updated by minimization of the error (i.e. appearance 
difference in Euclidean distance sense) between the region 
of an image at that position and the warped template of the 
target from the preceding image in the sequence. In more 
detail, given the appearance of the target /^(x, y) in the i-th 
image in a sequence, the corresponding region of interest in 
the subsequent frame /i+i is found by minimizing: 

[Ii+i{W{x,y]Tp)) - Ii{x,y)], (2) 

(x,y)en 


where IZ is the quadrilateral region of interest specifying the 
target in /^, W the warping function (an affine warp in our 
case), and p the warp parameters. The optimal p is found 
through iterative descent by linearizing Ii+i{yV{x,y;p)) 
using Taylor expansion, giving: 


E 

{x,y)en 


dW 

/i+i(W(x,2/;pj)) + V/—Ap - Ii{x,y) 


(3) 


work well on our data set and expect a similar level of per¬ 
formance on imagery acquired under similar conditions (el¬ 
evation and angle to the target). 


2.2.2 Appearance set generation 


Ideally, the target recognition system receives views of the 
target across 360° range, obtained by a circular reconnais¬ 
sance manoeuver over the target. However, as a conse¬ 
quence of the difficulties involved in the camera’s control 
system locking onto the target (see Section [E^ , the range 
of views obtained from each flyover is incomplete. Fur¬ 
thermore, in the experiments reported in this paper, it was 
decided to use each tracking burst (from the initial detec¬ 
tion of the target until the target is lost as described in 


Section 2.2.1 ) as a single training set, thus restricting the 
range of views available for training. The reason for this 
lies in the problem of concatenating aforementioned track¬ 
ing bursts into one data set without introducing training data 
artefacts. These include, for example, repeated views of the 
target or differing aeroplane viewpoint angle due to multiple 
flyovers. 

The described fragmentation of input video sequences is 
a serious problem as the baseline method of interest offers 
no invariance in this regard. As described in Section |2.1[ 
for canonical correlations to extract meaningful similarity 
between two data sets, a common form of variation must ex¬ 
ist. While robust to the presence of dissimilar data, whether 
due to different viewing conditions or noise, true rotational 
or view invariance is not inherent in this approach. 

Instead of attempting to achieve invariant matching, in 
this paper we examine an alternative approach whereby in¬ 
variance is achieved explicitly, by synthetic data augmenta¬ 
tion. We summarize the key stages in the proposed method: 


Re-warping: Our tracking approach involves the es¬ 
timation of pose parameters for the target. To quasi- 
normalize this view, we “re-warp” the target to fit its 
initial pose (i.e. the pose in the first frame as explained 
in Section [E^ , as shown in Figure 


2. Full view generation: The “re-warped” image of the 
target is now synthetically rotated across 360° at 10° 
intervals, as shown in Figure 

3. Subspace estimation: Synthetic views from all 
tracked templates are compiled and used to estimate 
a linear subspace, which is used as the final represen¬ 
tation of target’s appearance in the video. 


3. Empirical evaluation 


where po are the initial warp parameters, and pi... pj the In the empirical evaluation reported in this paper, we 

iterative refinement sequence. We found this approach to used data acquired during five flights and multiple target 







Figure 2. The first step of our data processing and normalization 
involves “re-wrapping” each tracked patch to fit the initial pose. 



50° 60° 

Figure 3. Seven synthetically generated views of the target, pro¬ 
duced by rotating the “re-wrapped” patch (see Figure across 
different angles. 

Table 3. A summary of data obtained during the five flights which 
were used for empirical evaluation. 


Flight 

Date 

Images 

Resolution 

1 

28 October 2009 

28364 

1360x1024 

2 

06 November 2009 

34262 

1360x1024 

3 

10 November 2009 

17636 

1360x1024 

4 

12 November 2009 

14076 

1360x1024 

5 

17 November 2009 

21108 

1360x1024 


tance (mainly affected by its altitude). The size of the target 
image patch diagonal lied in the range of 70 to 200 pixels. 
We normalize tracked target patches by warping them to the 
uniform scale of 100 x 100 pixels. 




$ 




Carl 



Car 3 


Figure 4. Six different targets, shown in three different poses each, 
used to test the proposed algorithm. 


flyovers. The details of each session are summarized in Ta- 

bleE 

To test the proposed recognition system we used six dif¬ 
ferent targets extracted from these five flights. Three views 
of each and their symbolic names are shown in Figure 
The scale of each target varied throughout input video, de¬ 
pending on its geometry (significantly different for a car and 
a tent, for example), camera viewpoint, and aeroplane dis- 


3.1. Results 

Target recognition was performed by matching a novel, 
query data set against training sets of each of the six tar¬ 
gets. It was identified as the target that it matched with 
the highest confidence. The results of rank-1 recognition 
are summarized using the confusion matrix in Table As 
the confusion matrix illustrates, the proposed method cor¬ 
rectly identified the novel target in all but a few cases. An 


























Table 4. The target confusion matrix obtained in our experiments 
(shown is error rate in %). Our algorithm made few errors (wrong 
target assignments) and the few that were made can be seen to 
correspond to genuinely similar objects. 



Carl 

Tent 

Van 1 

Car 2 

Van 2 

Car 3 

Total 

Carl 

- 

0.0 

3.4 

0.0 

0.0 

4.7 

7.0 

Tent 

0.0 

- 

0.0 

0.0 

0.0 

0.0 

0.0 

Van 1 

2.2 

0.0 

- 

0.0 

0.0 

1.5 

3.7 

Car 2 

0.0 

0.0 

0.0 

- 

1.9 

0.0 

1.9 

Van 2 

0.0 

0.0 

0.0 

2.9 

- 

0.0 

2.9 

Cars 

6.7 

0.0 

0.8 

0.0 

0.0 

- 

7.5 


inspection of incorrect target assignments readily shows a 
clear structure of such errors, targets with genuinely similar 
appearance being occasionally confused with one another 
(e.g. car 2 and van 2 which are both dark, of similar shape, 
and only a small difference in size which is not readily ap¬ 
parent from images considering that the camera-target dis¬ 
tances are unknown). 

We further tested the sensitivity of our method to the 


amount of training data. As explained in Section 2.2.2 


principle even a single image of the target can be used for 
matching, as a synthetic, compatible 360° set is generated 
from each image. Thus, we explored how the correct recog¬ 
nition rate is affected by gradual removal of an increasing 
amount of tracked templates. Figure summarizes the re¬ 
sults obtained and shows that the proposed method exhibits 
slow, graceful performance degradation. 



Figure 5. Decay of correct identification rate as the amount of data 
used for training is reduced. Graceful degradation is demonstrated 
with a high recognition rate even when 40% of the data is dis¬ 
carded. 


4. Conclusions 

This paper presented preliminary experiments and find¬ 
ings for target recognition from unmanned reconnaissance 
aircraft using a method based on canonical correlations, a 
well known statistical method for comparing sets of high 


dimensional vectors. A framework for employing canoni¬ 
cal correlations in this scenario was described, followed by 
a description of experiments conduced to assess its effec¬ 
tiveness. These preliminary results show promising perfor¬ 
mance with high correct recognition rate and graceful per¬ 
formance decay in the presence of a reducing amount of 
data. 

4.1. Future work 

Results reported here encourage and call for more exper¬ 
imental and research effort in the development of a canoni¬ 
cal correlations-based target recognition system. These in¬ 
clude: 

• Data variation modelling: In this paper we only eval¬ 
uated the performance of canonical correlations using 
linear subspaces. This approach is almost always infe¬ 
rior to one which takes into account the nonlinear na¬ 
ture of object appearance manifolds. Specifically, we 
would like to investigate the performance of a method 
which would represent each 360° appearance variation 
as a set of subspaces, which are then mutually com¬ 
pared in a manner similar to that described in cni. 
Another promising direction involves the use of prob¬ 
abilistic extensions of canonical correlations O. 

• Different appearance representation: Here we only 
explored the use of raw image appearance to model 
target appearance. The use of more complex repre¬ 
sentations, e.g. based on oriented gradients EHEIlia, 
robust edges lEKElIll or colour invariants could 
result in increased robustness of the method and pos¬ 
sibly remove the need for synthetic data augmentation 
over full 360°. 
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